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1 Abstract 



We extend the mixtures of Gaussians (MOG) model to the projected mixture of Gaussians (PMOG) 
model. In the PMOG model, we assume that q dimensional input data points Zj are projected by 
a q dimensional vector w into 1-D variables U{. The projected variables Ui are assumed to follow 
a 1-D MOG model. In the PMOG model, we maximize the likelihood of observing m to find both 
the model parameters for the 1-D MOG as well as the projection vector w. First, we derive an 
EM algorithm for estimating the PMOG model. Next, we show how the PMOG model can be 
applied to the problem of blind source separation (BSS). In contrast to conventional BSS where 
an objective function based on an approximation to differential entropy is minimized, PMOG 
based BSS simply minimizes the differential entropy of projected sources by fitting a flexible MOG 
model in the projected 1-D space while simultaneously optimizing the projection vector w. The 
advantage of PMOG over conventional BSS algorithms is the more flexible fitting of non-Gaussian 
source densities without assuming near-Gaussianity (as in conventional BSS) and still retaining 
computational feasibility. 



2 Introduction 

The mixture of Gaussians (MOG) is a flexible model with application to many real world problems. 
The adjustable parameters in a 1-D MOG model include the number of component Gaussian 
distributions, the mean and variance of each component distribution and their mixing fractions. 
Given 1-D data points Uj, these distributional parameters can be efficiently estimated by maximum 
likelihood (ML) using the expectation maximization (EM) algorithm [7). Now consider the following 
situation: 

• Data points U{ are not given directly but suppose that we are given vectors Zi. Next, 1-D 
scalar variables are generated by U{ = w T Zi where w is an unknown projection vector. 

• Suppose that projected variables u% follow a 1-D MOG model which we will refer to as a 
projected mixture of Gaussians (PMOG) model in this work. 

Can we estimate the PMOG distributional parameters as well as the projection vector w using 
ML? In particular, can we derive an EM algorithm for this PMOG model similar to the standard 
EM algorithm for the conventional MOG model? 

While estimating the PMOG model is an interesting problem in its own right, we will show that 
it is also closely related to the problem of estimating the differential entropy of a random variable. 
Given this fact, we will show that the PMOG model can be applied to the problem of linear blind 
source separation (BSS) and linear independent component analysis (ICA). We use the term ICA to 
refer to "square mixing" where there are equal number of latent sources and mixed signals and BSS 
to refer to "non-square and noisy" mixing where there are more mixtures corrupted with additive 
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Gaussian noise than latent sources. From this point of view, ICA is a special "noise free" case of 
BSS. 

BSS or ICA is a well studied problem and we refer the reader to [5j \TE[ 2] for a detailed overview of 
relevant work. A central point in BSS is choosing the "measuring function" for differential entropy of 
a random variable. One of the most widely used algorithm for BSS is the FastICA (FICA) algorithm 
|13j . The FICA algorithm works by optimizing the differential entropy based contrast functions 
developed in the seminal work by Hyvarinen et al. |12| . These contrast functions are approximations 
to the differential entropy of a random variable x under the following conditions: 

• Expected values of certain functions Gi{x) are given. The density of x is estimated to be the 
maximum entropy distribution (MED) f med (x) subject to the these constraints. 

• Most importantly, the assumed MED f med (x) is, in the words of Hyvarinen et al. [12J "not 
very far from a Gaussian distribution". Let us denote this simplified form of f med (x) by 

fmed f T \ 
J Gaussian\ I 

Hyvarinen et al. calculated expressions for the differential entropy using the distribution fGaussian( x ) 
given flexible user defined functions Gi(x). A key question that arises is: are these approximations 
to differential entropy and BSS solutions based on fc;aussian( x ) adequate when the true density and 
hence f med (x) is not "near Gaussian"? 

In another seminal paper, Attias et al. pQ developed a general solution to the BSS problem where the 
latent source density was modeled as a "factorial MOG" density. This significant advance removed 
the "near Gaussianity" assumption on the latent source densities. In addition, [1] developed an 
EM algorithm for the ML solution of the mixing parameters and MOG source parameters in BSS 
followed by a posterior mean or maximum aposteriori (MAP) estimate of the sources. However, 
this algorithm becomes computationally intractable for > 13 sources. Moreover, the solution by 
Attias et al. assumes "exact independence" between the sources i.e., it does not allow any partial 
dependence between sources. Can we develop a solution to the BSS problem that retains the flexible 
latent density modeling of jT], retains computational tractability and can be applied under partial 
dependence between sources? 

In this work, we develop the PMOG model and then apply it to the problem of BSS to address 
these questions: 

1. We describe the PMOG model and derive an EM algorithm for estimating its parameters in 
section [4l 

2. We show how PMOG can be applied to solving a BSS problem including cases where partial 
dependence between sources is allowed in sections [6], [7j 
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3 Notation 



• Scalars will be denoted in a non-bold font (e.g. 7, 9, L) possibly with subscripts (e.g. ttj,, 
/ik). We will use bold face lower case letters possibly with subscripts to denote vectors (e.g. 
H, £C,Zi) and bold face upper case letters possibly with subscripts to denote matrices (e.g. 
A, £, £?i). The transpose of a matrix A will be denoted by A T and its inverse will be denoted 
by A~ l . We will denote the p x p identity matrix by I p . A vector or matrix of all zeros will 
be denoted by a bold face zero whose size should be clear from context. 

• The jth component of vector Xi will be denoted by Xij whereas the jth component of vector 
x will be denoted by Xj. The element of matrix A will be denoted by A(i,j). Estimates 
of variables will be denoted by putting a hat on top of the variable symbol. For example, 
an estimate of a 2 will be denoted by a 2 . The 2-norm of a p x 1 vector x will be denoted by 

Ni2=+Vn=i*?- 

• If a; is a random vector with a multivariate Normal distribution with mean \x and covariance 
5] then we will denote this distribution by N {x | /x, £). Similarly, if u is a scalar random 
variable with a Normal distribution with mean fi^ and variance a\ then we will denote this 
distribution by M (u \ /i^ , a^) . We will use U(a, b) to denote the uniform random distribution 
on (a, b). Unless otherwise stated all logarithms are natural logarithms (i.e., log means log e ). 
Probability density functions will be denoted by capital letters with the vector of arguments in 
parenthesis such as P(x) or Q{x). When necessary we will also indicate the dependence of a 
probability density on a vector of parameters 6 by using the notation P(x \ 6). The expected 
value of a function f(x) with respect to the density P(x) will be denoted by E x [f{x)\ or 
simply by E[f(x)]. 

4 Projected mixture of Gaussians (PMOG) model 

In this section, we try to answer the following questions: 

• What is the PMOG model? What is given and what needs to be estimated? 

• Can we derive an EM algorithm for estimating the PMOG model? 

• How are the MOG parameters and projection vector estimated in the M-step? 

• What other precautions need to be taken to make the monotonic likelihood improvement 
property of EM hold true? 
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4.1 What is the PMOG model? 



Suppose we are given n vectors z±, z%, . . . , z n where each Zi is a q X 1 vector. Suppose Z is a q x n 
matrix formed by assembling these vectors into a matrix i.e., 



[z%,Z2, ■ ■ ■ ,Z r 



(4.1) 



We can think of these vectors as n realizations of a random vector z. Suppose w is an unknown 
q x 1 vector that defines new "projected" scalars u±,U2, ■ ■ ■ ,u n such that 



T T 
Ui = W Zi = Zi w 



(4.2) 



Again, these n scalars can be thought of as n realizations of the random variable u = w T z. Suppose 
we are also given a q x L matrix G of full column rank L with q > L. It is given that the unknown 
vector w satisfies the constraints 



w T w = 1 
G T w = 

It is assumed that the density of random variable u is a mixture of R Gaussians i.e., 

R 



P(u I 7T,/i,er 2 ) = ^7T fe AA (u | Hk^l) 



(4.3) 
(4.4) 



(4.5) 



fc=l 



In the above equation, tv is a R x 1 vector [tti,7T2, ■ ■ ■ ,t^r] T - Similarly /i = [/ii,/X2, 
<r 2 = [af, • • • , are also fix 1 vectors. The class fractions tt^ satisfy: 



R 

J> fc = 1 

fe=i 

< VTfc < 1 



,^i?] T and 

(4.6) 
(4.7) 



Equation 4.5 can be equivalently written in terms of z as follows: 



P(w T z | 7T, /i, er 2 ) = 7Tfc (io T 2: 



2^ 



k=l 



(4.8) 



We will call the model 4.8 a projected mixture of Gaussians or PMOG model. A pictorial description 
of the PMOG model is given in Fig. [l] 
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Vector w is unknown 



Input data are n vectors of size q x 1 
Z\ z 2 z 3 ■ ■ ■ 



4 



gxl projection vector }> Projected points 



Model usin 



R component PMOG model 



Empirical distribution of w T Zi over i 



Zi = q x 1 input vectors 
w = q x 1 projection vector 
w T Zi = 1-D projection of vector Zj using w 
7r = class fraction vector in PMOG 
Li = mean vector in PMOG 
er 2 = variance vector in PMOG 



P (w T Z | 7T, fX, (T 2 ) = 1TkJ\f (w T Z | /i/t, (T, 



I I 

Dirichlet prior on n\ P(n | (3) Inverse Gamma prior on tr 2 : P(a 2 \ 0,-f) 



max F(tt, ft. a 2 ,w) = P(w T z lt w T Z2, ... , w T z„ | n, fi,cr 2 ) P(tv \ f3) P[a 2 \ 6, -y) 

7r.jz,CT 2 .ID 



EM algorithm to maximize F(tv, ii, a 2 ,w) 
PMOG objective function ^ 

Optimal 7r,/i,(T 2 and w 

Figure 1: Pictorial depiction of the PMOG model. Input vectors are projected into 1-D variables 
using projection vector w. The 1-D empirical density of projected points w T zi is modeled using a 
MOG density. In contrast to conventional MOG, the PMOG model requires estimation of both the 
MOG parameters n, fj,, cr 2 as well as the projection vector w. The overall objective function F also 
includes the influence of priors for 7r and a 2 . This is done to prevent the collapse of a component 
density of PMOG onto a single point. 
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4.2 Estimating the PMOG model 

Assuming that w T Zi for i = 1,2, ... ,n are n independent realizations of w T z, we can write their 
joint density as 



P(W T Z 1 ,W T Z 2 , . . . , W T Z n | 7T, fl, a 2 ) = Y[ P(w T Zi | 7T, /X, <J 2 ) 



(4.9) 



To simplify notation, we will denote the left hand size of 4.9 by P(Z w | 7r,/x, cr 2 ). With this 
notation, we can write: 



P(Z T W | 7T, fl, a 2 ) = Yl P(w T Zi | 7T, /X, <T 2 ) 



(4.10) 



The problem is to maximize P(Z T w \ iv, fi, a 2 ) (equation 4.10) w.r.t 7r, fi, a 2 and w. For a given 
w this problem is equivalent to the standard mixture of Gaussians (MOG) estimation problem and 
can be handled efficiently by the expectation maximization (EM) algorithm [7\. In the PMOG 
model, the novelty is to allow w to be an unknown that is estimated along with MOG parameters 
7r, /x, a 2 . Next we introduce priors on 7r and a 2 . The purpose of introducing priors on 7r and cr 2 
is simply to prevent the collapse of a Gaussian component from PMOG onto a single data point. 
Suppose we assume a Dirichlet prior on the vector 7r. This prior is described by a R x 1 parameter 
vector /3 with elements f5\, fo, ■ ■ ■ , /3r. 



r 



P(ir\j3)<x]l4 



(4.11) 



k=i 



Similarly we assume a product of inverse Gamma prior on cr 2 . Suppose 6 is a vector with elements 
61,62, ... ,0r and similarly 7 is a vector with elements 71, 72, • • • , 7i? then 

R R 

2\-(0fe+l) 

k=i k=i 



P(a 2 I 0,7) = f[P(4 I k ,lk) oc J] (4) 



cxp 



1 



(4.12) 



We use an improper prior for //,, P([i) = 1. With this choice of priors the posterior distribution 
can be written as: 

P(Z T w I 7r,/x,cr 2 ) P(tt I 0) P(a 2 \ G,j) 



P(lV,fl,(T A I Z 1 W) 



(4.13) 



P(Z T w) 

Maximization of this posterior density w.r.t both 7r, fi, a 2 and w is difficult because of the presence 
of w in the denominator. We could however, maximize the posterior density w.r.t 7r,/x, cr 2 only 
and maximize the likelihood term P(Z T w \ 7r,/i,er 2 ) w.r.t w. This is equivalent to solving the 
problem: 

max F(ir, fi, a 2 , w) = P(Z T w \ ir, fi, a 2 ) P(tv \ (3) P(cr 2 | 0, 7) | (4.14) 
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4.3 EM algorithm for maximizing F 



The main development in this subsection is an EM algorithm for maximizing F. It is much easier 
to deal with the logarithm of F rather than F itself. Now 

log F(tt, fi, a 2 ,w) = log P(Z T w | 7r,/x,cr 2 ) +logP(7r | (3)+ log P(a 2 \ 0,j) (4.15) 

Substituting |4~Tol \4~U\ and [432] into [415] we get: 

logF(7r,/2,cr 2 ,™) (4.16) 



R 



R 



oc log ( np(w T Zi i vr^,^) ) +io g ( n^ _i ) + io § ( n kt (%+1) ex P 



U=l 



Wl=l 



vfc=l 



which can be simplified to: 
logF(7r,/2,cr 2 ,™) 



(4.17) 



n K f 1 \ 

oc lQ g P(w T Zi | 7r,/x,cr 2 ) + J^(/3 fc - l)logvr fe + ^(-(^ + l)log^ f J 

i=l fc=l fc=l ^ ^ kG kJ 

In the above equation, we have ignored the constants of the prior densities on tv and <x 2 since they 
do not depend on the unknown parameters. Thus the objective function to be maximized can be 
written as: 

H(n, fi,cr 2 ,w) = Hx(ir, /Li, a 2 , w) + H 2 { / K,a 2 ) 



where 



and 



Hi(-k, /x, a 2 , w) = 1°§ F{w T Zi | 7r,/x, a' 



i=l 



H 2 (tt,* 2 ) = V(/3 fe -l)log7r fe + V(-(0 fc + l)log<T 2 -^V) 



n _R 



Substituting |4.8| into |4.19| we get: 

iii(7r,/x,er 2 , w>) = ^log^7r fc jV(iw T «i | /x^cr 2 ,) 
i=l fc=l 

Introducing auxiliary variables a^j such that: 

R 

fe=i 

< a fci < 1 



(4.18) 
(4.19) 

(4.20) 
(4.21) 



(4.22) 
(4.23) 
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we can write 



R 



rr i 2 V^, sr^ I {w T Zi\ ii k ,al) 



i=l k=l \ 

By using the concavity of the log function, we can write: 

n R 

Hi(it,n,a 2 ,w) > Q(TT,fi,cr 2 ,w,a) = ^^a fci log 

1=1 k=l 



OLki 



ot-ki 



TT k N (w T Zi | Hk,crl) 



From l4l8l we see that 

H(7r,n,a 2 ,w) = Hi(7v,fi,(T 2 ,w) + H 2 (n,a 2 ) > Q(n, /z, <r 2 , tu, a) + H 2 (tt,<t 2 ) 
The rightmost inequality becomes an equality when we choose: 

TTfcTV (w T Zi | Hk,<7k) 



OLki 



Efc^^A/ ' {w T Zi I ^ifc, cr|) 



(4.24) 



(4.25) 



(4.26) 



(4.27) 



In the context of EM, the are also known as responsibilities. Suppose ir®, /z^, a , are 
the parameter values at iteration t and be the corresponding responsibilities computed from 
14.271 Then we have 

H (vr^ 1 ) , , <x 2(m) , w^ t+1 A (4.28) 

= Q (7v( t+1 \^ t+1 \<T 2{t+1 \w( t+1 \ a ( t+1 A + H 2 (7v( t+1 \a 2{t+l) ) (by p6]andp7} 



>Q[^>,n 



„.(*+!) „.2(*+l) ^(t+1) „(*) 



aw + H 2 tt 



(by 4.26 and 4.27) 



Now suppose, given -k^\ fj,^ , <r , and a®, we calculate ir^ t+1 \ ^ t+1 \ <r +1 \ k/* +1 ) such 
that 

Q (#V (H V (W) ,tt (m U (t) ) +^2 (ttC+V 2 ^) (4.29) 
> Q (ttW, /xW <r 2(t) , a W) + tf 2 (ttW, a 2(i) ) 



From 4.29 and 4.28 we get: 



F(7r('+V (m U 2( ' +1 V m) 



> Q ( tt W , //« , a 2 {t) ,«WaW]+H 2 (irW*7 



(4.30) 



rW ~ 2W 



(by 4.26 and 4.27) 



This shows that the objective function H increases monotonically from iteration t to iteration (i + 1) 
and hence converges to a local maximum. This gives us the following EM algorithm for the PMOG 
model: 



Initialization Choose initial values of parameters either randomly or by other techniques such 
as clustering using k- means |17j . The output of this stage is the initial values 

Following initialization, the E and M steps outlined below are performed in an alternating fashion 
until convergence. 



E-step In the E-step, we calculate th e resp onsibilities using the current estimate of param- 
eters 7r(*) <r 3 and equation 



4.27 



M-step In the M-step, we solve the problem: 



7T 



(*+!) „(*+!) rr 2{t+1) «,(*+!) 



arg max Q ( it, /x, <t 2 , w, or*' J + H2 (tt, <x 2 ) (4-31) 

TT,fi,cr 2 ,w V 



In ordinary MOG, the objective function in 4.31 is a convex function of n,fi,cr 2 and a closed 
form solution exists for the M-step. If optimizing also w.r.t w then the objective function becomes 



non-convex and hence we need to explicitly impose the post optimization condition 4.29 for EM 
convergence. We use a relative error criterion to detect EM convergence. Our convergence criterion 
is: 



abs 



H(^ t+1 \ <x 2(m) , w^) - h(tt« m w <r 2{t \ «,(*); 



e* = e re i I mean abs 



V t 



(4.32) 
(4.33) 



Here e re i is a user specified relative error. We used e re i = 10 in our experiments 



4.4 Solving the M-step problem 



In this subsection, we discuss in detail the solution of problem 4.31 
variable 7r, /x. 



During the estimation of each 



and w we account for the relevant constraints using Lagrange multipliers and 
Karush-Kuhn- Tucker optimality conditions (see [19J for details). 



Estimating 7r Differentiating the M-step objective function w.r.t Tik and noting the constraint 
Ylk=i n k — 1 = the optimality condition is given by: 

£Q (», „, o>,v,, ««>) + » H, „.) _ A 8 (f; „ _ 1} = „ (4 .34) 
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Now, 



d 
dir k 



Q ( 7r, /x, <x 2 , to, a 



(*) 



1 n 

-E 



(*) 



5 



1 

TTfc 



(jS* - 1) 



From 4.34 and 4.35 we get: 



?=1 \i=l 



Imposing the constraint E&Ll ^fc = 1 and noting that EaLi a E = 1 we get: 

R / n \ R 

A = E E4? + (a - x ) )=> A = - x ) 

fc=l \i=l / fc=l 



Thus 7T/C is given by: 



7U- 



Er =1 «2? + (a - 1) 



(4.35) 
(4.36) 



(4.37) 



(4.38) 



(4.39) 



* + E£=i(&-1) 

We have ignored the inequality constraints < Tit < 1 on 7Tfe during optimization. By choosing 
> 1 we can ensure that these constraints are always satisfied and inactive and hence can be 
disregarded during optimization. 



Estimating /x Differentiating the M-step objective w.r.t and noting that H2 (7r,<x 2 ) is not 
dependent on fj,^, the optimality condition is given by: 

Q (7^, fi, <r 2 , w, cx^^l = (4.40) 



Upon substituting the derivative, we get 



± a <.)((^S^)) = „ (4 . 41) 



i-1 ■ 



Upon simplifying, ^ is given by: 



Mfc = „ * m ' (4-42) 



Ei=i «. 



11 



Estimating a 2 Differentiating the M-step objective w.r.t of the optimality condition is: 

^2 Q ^ (T ' 2 - W - a(t) ) + ^2 R 2 K °*) = 

Now, 

(t) J (w T Zj - //fc) 2 1 



fc i=l 



2^ 



2*2 



and 



3 



53* (-,-*) 



+ 1) 

2 4 



Substituting |4.44| and |4.45| in |4.43| we get: 

(t) / {w T Zj - fi k ) 2 1 



E« 

i=i 



Am' 



2^ 



Upon simplification, we get: 



2a 2 k j rrr 



2(0, + 1) + JXi 41 



a 



(4.43) 

(4.44) 
(4.45) 

(4.46) 



(4.47) 



Estimating w The Lagrangian function for optimizing w is given by: 

£(w, Ai, A 2 ) = Q (it, /Lt, cr 2 , w, q w ) - {Ai(w T w - 1)} - {\ 2 T G T w} (4.48) 



Since must satisfy the constraints 4.3, and since H2 (ir, <x 2 ) does not depend on w the optimality 
condition is given by: 



— ( 
dw 



TV, fx, a 2 , w, o;W 



d_ 
dw 



{M 



T 

W W 



In the above equation, Ai is the Lagrange multiplier for the constraint w T w = 1 and A2 is a q x 1 
Lagrange multiplier vector for the constraint G T w = 0. Substituting the derivatives, we can write 
the optimality condition as: 



n R 

EE 



o 



(*) 

ki 



(w T Zi - n k )zi 



2 Ai w - G A 2 = 



i=i fc=i v k 
Upon simplification, we can write the optimality condition as: 

b- Aw -2Xiw -G\ 2 = 



(4.50) 



(4.51) 
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where the vector b and matrix A are independent of w and are given by: 



n R (t) 



i=l fc=l 



(4.52) 



and 



n _R (t) 



i=l fc=l a fc 



(4.53) 



Premultiplying both sides of 4.51 by w T and noting the constraints on w given by 4.3 

w T b - w T Aw - 2Ai(l) -0 = 
In other words, the Ai is given by: 

Ai = - (w T b — w T Aw) 



we get: 



(4.54) 



(4.55) 



Similarly, premultiplying both sides of 4.51 by G T and noting the constraint G T w = we get: 

G T (b- Aw) -2X^0) - {G T G)X 2 = (4.56) 

Note that G is of size q x L with q > L and of full column rank L. This means that G T G is 
non-singular and invertible. Hence we can solve for A2 to get: 



A 2 = (G G)~~ G (b-Aw) 



(4.57) 



Substituting Ai from 4.55 and A2 from 4.57 into 4.51 we can re-write the optimality condition 
as: 



b — Aw — (w r b — w 1 Aw) w — G (G T G)~ 1 G T (b — Aw) = 



Let 



P G = I q -G (G T G)- 1 G T 



(4.58) 



(4.59) 



Essentially Pq is an orthogonal projector to the columns of G. Thus the optimality condition for 
w can be simplified to: 

P G (b - Aw) - {w T b - w T Aw) w = (4.60) 
Since w is a q x 1 vector, for fixed b and A this is a system of q cubic equations in q variables. 
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4.5 Breaking up the M-step into two parts 



Suppose we have already estimated the current respobsibilities a''' in the E-step. Our goal is to 
simultaneously solve the M-step equations for 7T, /x, <x 2 and w. Our strategy for solving the M-step 
equations is to note that with w fixed, 7r,ix and a 2 can be solved explicitly using 4.39, 4.42 and 
4.47 respectively. While for fixed it, /x and <x 2 (i.e., fixed b and A), we can calculate w by solving 



a system of cubic equations 4.60 Hence we can break up the M-step into 2 parts as follows: 



1. M-step, Part 1: 

In the first part, we keep w fixed at its current value w* and solve the following maximization 
problem: 

7r* , fi* , <x 2 * = arg max Q ( 7r, /x, a 2 , w* , ofi^ ) + H?. (tt, <t 2 ) (4.61) 



This is equivalent to solving for 7r, fi and <r using 4.39 



4.42 



and 



4.47 



M-step, Part 2: 

In the second part, we keep tt,(jl and a 2 fixed at their current values 7r*,/z*,<x 2 * and solve 
the maximization problem: 



w* = arg max Q ( 7r*, /x*, <t 2 *, w, a (t> ) + H2 ( 7r* , a 



(t) 



.* _2* 



(4.62) 



This is equivalent to solving the system of cubic equations 4.60 for w. 



We repeatedly perform the alternating maximizations in part 1 and part 2 until the absolute value 
of the change in H{it, /x, <r 2 , w) from one alternating maximization to the next is below a user 
specified tolerance em- We used sm = 10~ 3 in out experiments. 

There is a however a complication that needs to be taken care of when implementing the overall 
EM step. Note that when excluding w, the M-step objective has closed form solutions for 7r,/x 
and a 2 . In addition, we can show that these closed form solutions result in a maximization of the 
partial M-step function with fixed w. Can we say the same thing about maximizing w for fixed 
7r, fx and cr 2 ? Does the 2nd part of M-step for optimizing w converge to a maximum? Suppose w* 
is a solution to 4.60 The Hessian of the Lagrangian |4.48 is given by: 

72 



Vi w £{w*,\$,\ 2 *) = -A-2\* 1 I q 



(4.63) 



Here, is simply the Lagrange multiplier from 4.55 evaluated at w*. Note that A is a symmetric 
and positive definite matrix. If > then we can guarantee that V 2 ^ C{w* , A*, A2*) will be 
negative definite and the 2nd part of M-step will have found a local maximum. How can we be sure 
that A^ > 0? How can we ensure that the M-step solution for 7r, fi, a 2 and w is admissible? 



One way to solve this problem is to check 4.29 explicitly after convergence of each overall M-step. If 
4.29 is not satisfied then we re-initialize the M-step with a random vector w that satisfies G T w = 
and 1 1 -it? 1 1 2 = 1 and re-solve the 2 part M-step. This implies we are simply checking for an increase 
in the objective function 4.31 after each M-step. 
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4.6 Solving for the projection vector 

Solving for w is equivalent to finding zeros of the set of equations: 

f(w) =P G (b- Aw) - (w T b - w T Aw) w (4.64) 



that satisfy the constraints in 4.3. We consider below three possible cases: 



4.6.1 Case 1 

Suppose (b — Aw) = 0. Note that if w satisfies (6 — Aw) = then equation f(w) = is satisfied. 
Since A is invertible, define a candidate solution for the second part of the M-step as: 

w 1 = A 1 b (4.65) 

Is this an acceptable solution? If w\ satisfies w\ T w\ = 1 and G T w^ = then it is an acceptable 
solution. However, there is no guarantee that this will be true for w\ satisfying |4.65| Thus the 



solution 4.65 for w\ is not admissible. 



4.6.2 Case 2 

Suppose (w T b — w T Aw) = 0. If (w T b — w T Aw) = then solving f(w) = reduces to solv- 
ing: 

P G (b - Aw) = (4.66) 
Another candidate solution for the second part of the M-step is given by: 

w 2 = A' 1 (b - Grj) (4.67) 

where rj is an arbitrary L x 1 vector. Does this w 2 satisfy the (L+l) di stinct constraints W2 T w 2 = 1 
and G T w 2 = 0? The number of variables on the right hand side of 4.67 is L but the number of 



constraints that they have to satisfy is (L + 1) (constraints on w 2 ) and so the solution 4.67 for w 2 
is not admissible. 



4.6.3 Case 3 

Suppose (w T b — w T Aw) ^ and we solve for w that satisfies f(w) = 0. This means that w will 
satisfy: 

P G (b - Aw) = (w T b - w T Aw) w (4.68) 
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Will such a w satisfy w T w = 1 and G T w = 0? Premultiplying both sides of 
noting the definition of Pg in |4.59| we get: 



4.68 



Since (w b — w Aw) ^ 0, 



G T P G (b - Aw) = = (w T b - w T Aw) G T w 
9 gives us: 

G T w = 

Premultiplying both sides of 4.68 by w T we get: 

w Pq [b — Aw) = (w b — w Aw) w w 
From 4.59 and 4.70 we know that w T Pq = w T and so 4.71 can be written as: 



w T (b — Aw 



Since {w T b — w T Aw) ^ 0, 



4.72 



I = (w T b — w T Aw) = (w T b — w T Aw) w T w 

gives us: 



w T w = 1 



by G T and 
(4.69) 

(4.70) 

(4.71) 

(4.72) 

(4.73) 



Therefore if (w T b — w 1 Aw) ^ then any solution to f(w) = will satisfy both w T w = 1 and 
G T w = 0. To solve for w such that f(w) = we can use Newton's method. The Jacobian of / 
w.r.t w is given by: 



J(w) = -P G A-wb T - (b T w) I q + (w T Aw) I q + 2 ww T A 
Assuming, J(w) is non-singular the basic Newton update is given by: 



w 



(i+l) 



w 



J w 



-1 



(4.74) 



(4.75) 



To avoid problems caused by intermediate singularity of J we can replace \j (w^\\ 1 by a modified 

version [«7 (w^) + rjlql 1 for a sufficiently large rj such that \J (w^} + T)I q ] is non-singular (see 
|19j for more details). For instance, we could use the Levenberg-Marquardt modification of the 
Newton's method which essentially does this replacement. In this case, the update equation for w 
can be written as: 



J (w^) +rjl, 



-i 



w^ ' = W K 

The projection vector w is initialized for the Newton iteration as follows: 

w init = p G A~ 1 b 



w 



init 



W 



init 



\w tnt ^ \ 1 2 



(4.76) 

(4.77) 
(4.78) 



The complete EM algorithm pseudocode is given in Fig. [2] 
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EM algorithm for estimating the PMOG model 

Require: q x n matrix Z, number of Gaussian components in MOG R, q x L matrix G (L < q), 
Dirichlet prior parameter vector (3 for 7r, Inverse Gamma prior parameter vectors 6 and 7 for 
a 2 and convergence tolerances e re i, em 
1: Select randomly such that G T w^ = and ||«/°'||2 = 1. Initialize the R x 1 vectors 
7r(°), fj,(°\ er 2 ^ using the A;-means algorithm on the projected points Z T w^ and set found = 
and t = 

while found = do 

E-step: Calculate the respo nsibili ties a^' using the current parameter estimates 
,te('' and equation 



9 

10: 
11: 
12: 

13 
14 
15 



16: 



To start the M-step, set w* = w 



4.27 







M-step, Part 1: Given qW and iu* optimize the M-step objective function w.r.t and 



,2. 



7T*, /x*, <t 2 * = arg max Q ( 7T, /i, rr 2 . o;^ j + i7 2 (7T, cr 2 ) 



(4.79) 



This problem has an explicit solution given by equations 4.39, 4.42 and |4.47 



M-step, Part 2: Given ac^' and 7r*,/i*,cr 2 * optimize the M-step objective function w.r.t 
w: 



w* = argmax Q ( n* , fi* , a 2 * . w. aW ) + 2 ( 7r 



(4.80) 



Solving 4.80 is equivalent to finding the roots of the equation 4.64 which can be done with 
Newton or quasi-Newton techniques as in 4.75| and 4.76 We initialize w as per 4.77 and 
solve 4.64 to give w* . Steps 1 and 2 are alternated repeatedly in each cycle until the abs olute 
value of the change in H(tt* , fi* ,a 2 * ,w*) from one cycle to the next is < em- Since 4.80 
is a non-concave maximization problem, the solution to |4.64 might converge to a minimum 
instead of a maximum. Hence we iterate as follows: 
validsolution = 
while validsolution = do 



if equation 4.29 is satisfied then 

validsolution = 1, set 7r(* +1 ) = tv* , ^ t+1 
else 

Initialize w* randomly such that G T w* 
re-solve the 2-part M-step 
end if 
end while 
If abs 



2(*+i) 



and 1 1 w 



a 2 *, 



w 



1. With this initialization 



the above equation e* = E re i (mean abs H(ir^\ (T \ 



< e* , set found = 1. In 



ti-t + 1 
17: end while 



)])■ 



Figure 2: EM algorithm for estimating the PMOG model. 

17 



5 The BSS problem 



In this section, we will try to answer the following two questions; 

• What is the BSS problem and why is its solution difficult? 

• What is the connection between BSS, differential entropy and source correlation? 
5.1 The linear BSS problem 

We consider the general version of a linear BSS problem. Vectors Xi S R p are generated from 
latent source vectors Si 6 R 9 as follows: 

Xi = /i, + Asi + £i (5.1) 

In this linear mixing model, A is the pxq mixing matrix with p > q, /x is the p x 1 mean vector and 
£i is the p x 1 noise vector with distribution N (ei | 0, & 2 I P ) • We assume without loss of generality 
that each component of Sj has zero mean and unit variance i.e., E(sij) = and E(s 2 j) = 1. Thus, 
the second order statistics of Si can be summarized as: 

E( Si ) = (5.2) 
E{s iSi T ) = S s (5.3) 

where X! s is the unknown q x q correlation matrix between the source components. Note that the 
diagonal elements of S s are ones (l's). No distributional assumptions are made on the density 
of vector Sj. One simply requires that the components of vector Sj are minimally dependent (or 
maximally independent) on each other in an information theoretic sense. Given n independent 
realizations of the vector Xi (generated from n independent realizations of sources Sj), the goal is 
to estimate the unknown sources Si when A, fi and the noise variance a 2 are unknown. 



5.2 Why is the BSS problem difficult? 

The BSS problem is difficult because the estimation of mixing parameters A, fj, and a 2 is coupled 
with the estimation of latent sources Sj. Note that the equation 5.1 gives us the conditional density 
of Xi given Sj as: 

P{xi | Si) ~ Af(xi | fi + Asi,a 2 I p ) (5.4) 

The standard approach of estimating mixing parameters by maximum likelihood (ML) would re- 
quire the computation of marginal density P{xi). If the joint source density is P(si), then we can 
write: 

P(xi) = J P{xi | Si) P( Si ) dsi (5.5) 
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If the above integral is tractable then we can compute P{xi) and susequently the ML solution 
for [i, A and a 2 . However, we are not given any parametric form for P(si). We are simply given 
that the components of Sj are maximally independent with mean, unit-variance and unknown 
correlation structure S s . This loose specification of P{si) is the root cause of difficulty in the BSS 
problem. Even if the mixing parameters /x, A and a 2 are known the computation of posterior mean 
or maximum aposteriori (MAP) estimate of Sj would require some specification of the density of 
Si since 

P(si | Xi) oc P(xi | Si) P(si) (5.6) 
Thus it is clear that the BSS problem is non-trivial. 



5.3 Measuring dependence in BSS 

We start this subsection with some fundamental definitions from information theory followed by a 
detailed study of mutual information as a contrast function for BSS. 

Definition 5.1. Kullback-Leibler divergence: Given probability density functions P(y) and 
Q(y), the Kullback-Leibler divergence (or KL distance) between P and Q is given by: 



KL(P,Q) = J P(y)log 
For any P and Q it is true that KL(P, Q) > 0. 



Q(y) 



dy (5.7) 



Definition 5.2. Differential entropy: Given a random variable y with density P(y), the differ- 
ential entropy of y is defined to be: 



H(y) = - J P(y) log P(y)dy (5. 



Definition 5.3. Non-Gaussianity: A concept related to differential entropy is the non-Gaussianity 
(NG) of a distribution. Given agxl random vector y with density P(y), mean [i y and co- variance 
'Sy, suppose N{y | /^,S y ) is a Normal density with the same mean and co-variance as y. Then 
the non-Gaussianity of y is defined to be (see [3]): 

NG{y) = KL{P{y),M{y\ l J,y,Y,y)) (5.9) 

An important property of non-Gaussianity is invariance to invertible linear transformations i.e., 

NG(a + By) = NG(y) (5.10) 

for any non-singular matrix B and vector a. 
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Remark 5.4. Differential entropy and non-Gaussianity: Invoking the definition of KL di- 

p(v) 



vergence in 5.9, we get: 



NG{y) = / P(y) log 



dy 



After some algebraic manipulations and noting that y has co- variance S y we get: 

NG{y) = -H{y) + | log 2ne + \ log[det(E tf )] 



(5.11) 



(5.12) 



Note that in equation 5.12 q is the length of the vector y. 



Definition 5.5. Mutual information: An information theoretic measure of the dependence 
between components of a q x 1 random vector s with density P(s) is the KL distance between P(s) 
and the density of s when its components are independent i.e., P td (s) = Yij=i Pj{ s j) where Pj is 
the marginal density of Sj. This KL distance is also called the mutual information [3] between the 
components of s: 

1(a) = KL(P{s),P id {s)) (5.13) 

Substituting the definitions of KL distance and p td and simplifying we get the classical expression 
for mutual information between components of s: 



I(s)=J2H(s j )-H(s) 

i=i 



(5.14) 



Suppose the vector s is generated from another q x 1 vector z with density P(z) via a non-singular 
linear transformation: 

T 



s = w T z 



w 2 



\w q T J 



(5.15) 



then we can re-write equation 5.14 



as: 



1(a) = H( Sj ) - H(z) - log[abs(det(VT T ))] 

3=1 



(5.16) 



where H(z) is the differential entropy of z. If we are given that s satisfies E(s) = and E(ss T ) 
S s with l's on the diagonal then z must have mean and co- variance S z satisfying: 



£ s = W T Y> z W 



(5.17) 
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Taking determinants on both sides, it is easy to see that: 



abs(det(W T )) 



det(S £ 



det(£ z )2 



(5.18) 



Substituting 5.18 into 5.16 we see that: 




log[det(S s )] } + - log[det(E z )] - H(z] 



(5.19) 



dependent only on P(z) 



dependent on W T 

This equation is identical to equation (16) in |3j. This can be seen by replacing differential entropy 
by non-Gaussianity in 5.19 using 5.12 noting that individual components Sj of s have unit variance 
and using the invariance property |5.10| for non-Gaussianity. Note that the diagonal elements of 
X s are ones and so the correlation C(y) as defined in [3] is equivalent to — \ log[det(E s )]. It is 
clear that given a density P(z) for z, the second term is independent of the linear transformation 
W T . Thus the depen dence bet ween components of s is fully captured by the first term alone which 
depends on W T (see 



5.17 



and 



5.15). 



Remark 5.6. Note on the — \ log[det(S s )] term: First, note that 5] s is a correlation matrix 
with Is on the diagonal. If (f>x, (j>2, ■ ■ ■ , (/> q are its eigenvalues then it follows that 



trace(S s ) = q 
det(E s ) = 



i=l 



n 



Therefore it follows that: 

<? 

log[det(S a )] =J2 l °g<t>i 

L 



i=i 



i=l 



< glog 

= q log 
= o 



It 



i=i 



by concavity of log function 
using the trace condition in 5.20 



(5.20) 
(5.21) 

(5.22) 
(5.23) 

(5.24) 

(5.25) 
(5.26) 
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Thus we see that 



log[det(S a )] < or 
--log[det(S a )] > 



(5.27) 
(5.28) 



The minimal value of — | log[det(Xl s )] is and is attained when det(X s ) = 1. Since X! s is a 
correlation matrix with Is on the diagonal the only way det(X! s ) = 1 can hold is if S s = I q . 



The term — \ log[det(X! s )] attains a minimal value of when 5] s 



The optimization problem for minimizing dependence between components of s can therefore be 
written as: 



mm 



/(«) 



+ v 



i 



log[det(S a )] 



encourages non-Gaussianity 



> 0, encourages S s 



where: 



subject to: 



w/z, abs(det(VF T )) 



det(E s )i 
det(S z )i 



W 1 V Z W 



Since the sources Sj have unit variance, using the property 5.12 we get: 



(5.29) 

(5.30) 

(5.31) 

(5.32) 
(5.33) 



NG( 8j ) = -H{ Sj ) + -log27re 



(5.34) 



Minimizing f(s) is therefore a problem of minimizing the sum of 2 different terms. The first 
term measures the differential entropy of each component Sj. As seen from 5.34 minimizing 
the differential entropy under the unit variance constraint is equivalent to maximizing the non- 
Gaussianity NG(sj). Thus minimizing the first term encourages non-Gaussianity. The second 
term measures the 2nd order cross-correlation between sources. Since this term is minimized when 
sources are uncorrelated, it encourages uncorrelatedness of the sources. The weighting constant ijj 
serves to balance the two terms in the objective function. It is instructive to note the form of the 
objective function for various values of ip: 

• When ip = 0, the 2nd term drops out. In this case, minimization of differential entropy (or 
equivalently maximization of non-Gaussianity) of sources is considered much more important 
than minimizing correlatedness. Without making any additional assumptions on the correla- 
tion structure of s, the projection vectors Wi are constrained by the relation Wi T Yl z Wi = 1. 
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When ip = 1, then minimizing 5.29 is equivalent to minimization of mutual information I(s). 
In this case, equal importance is given to maximizing non-Gaussianity and uncorrelatedness. 
As before, the projection vectors Wi satisfy Wi T Yl z Wi = 1. 

When \j) = oo, the 2nd term is forced to become at S s = I q . In this case, non-Gaussianity 
is maximized under the uncorrelated source assumption. This means that the projection 
vectors are constrained by: Wi T Jl z Wj = 5ij where 5ij = 1 if i = j and otherwise. 



The objective function for minimization in 5.29 can also be re-written as: 

min w r f(s) = j 1 + V |-^log[det(E z )]-log[abs(det(W T ))]| 



v ~ . > 0, encourages S s = I a 

encourages non-Gaussianity 

(5.35) 

where: (5.36) 



Sj = Wj T z, abs(det(^ T )) = det ^^; (5.37) 

det(E 2 s 



2 



subject to: (5.38) 
Y, S = W T ^ Z W (5.39) 



6 Solving the BSS problem 

Solving the BSS problem means estimating both the mixing parameters /j,, A and a 2 and the 
latent sources Sj given n mixed vectors x\. We will try to answer the following questions in this 
section: 

• What are the main solution approaches when the sources are assumed to be uncorrelated and 
how does the work of Hyvarinen et al. |12] relate to the problem of BSS? 

• How should the solution approach change when non-zero second order source correlation is 
allowed? 



6.1 Case I: Uncorrelated sources 

In this case, we assume that S s = I q i.e., the sources are uncorrelated. There are two main solution 
techniques in this case. 
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6.1.1 (1) The solution by Attias et al. under exact independence 



Attias et al. [lj in a seminal paper described a general solution to the linear BSS problem under 
exact independence. He assumed that each component of the vector Sj is described by a mixture 
of Gaussians (MOG) density. The logic for this was that since the MOG is a very flexible density 
(given sufficient components in the mixture), it should be able to describe more complicated and 
non-Gaussian source densities. Given the fact that components of Sj are exactly independent, we 
have 

i 

P{.Si) = l[P j (s ij ) (6.1) 

3=1 

where Sij is the jth component of vector and Pa is the marginal density of the jth component of 
Si. In other words, the source density P(si) has a very flexible parametric form of a product of q 
MOG densities. This type of a density for Sj is also called a "factorial" MOG which is a special case 
of a mixture of co-adaptive Gaussians densities. Attias et al. also showed that under this parametric 
form the integral 5.5 is analytically tractable and the density P(xi) is also a mixture of co-adaptive 
Gaussians (although not "factorial" MOG as in the case of Sj). Attias et al. also derive an exact EM 
algorithm for obtaining the ML solution for A, fi, a 2 as well as the MOG parameters for the source 
density P(si). The only limitation of this algorithm is the computational intractability for large 
problems. Attias et al. note that a 13 source mixture with each source described by a 3 component 
MOG would require 3 13 sums in each E-step, making it computationally intractable. To overcome 
this problem Attias et al. propose a variational approximation to the ML solution instead of the EM 
algorithm. In summary, an exact solution for the BSS problem under exact independence exists, 
but is computationally intractable for large problems (greater than 13 sources) and therefore one 
has to resort to approximate solutions. 



6.1.2 (2) PPCA and least squares based solution 

An alternative and much simpler approach is to use an approximation to the density of Si only 
for the purposes of computing A, /j, and a 2 . This approximation should be such that it simplifies 



computation of the integral 5.5 making the ML solution possible under the approximate density of 



S{. A density that satisfies this requirement is P(si) = N (si \ 0,I q ). Under this assumption, the 



model 5.1 simply reduces to the probabilistic PCA (PPCA) model of Tipping et al. [20]. The ML 
solution for A, fj, and noise variance a 2 under the PPCA model (when Sj is Gaussian) is known. 
Let the p x 1 vector x be the mean of observations Xi and S x be the p x p sample co-variance 
i.e., 
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1 

X — / 3Ji 

n 

i=i 

1 " 

S x / (xi — x) {xi — x) 

n. ^— » 



Suppose 



UAU T 



(6.2) 
(6.3) 

(6.4) 



is the eigen decomposition of the symmetic matrix S x . Here £/ is a p x p matrix of eigenvectors of 
S x and A is a p x p diagonal matrix containing the corresponding eigenvalues of S x . 



A 



o 





A 2 



o \ 






(6.5) 



Suppose we order the eigenvalues of S x such that Ai > A2 • ■ ■ > A p > 0. Let U q be a p x q submatrix 
of U containing the eigenvectors corresponding to the q largest eigenvalues Ai, . . . , X q and A q be 
the q x q submatrix of A with Ai, . . . , X q on the diagonal. Then as shown in Tipping et al. the ML 
solution for mixing parameters under the PPCA model is given by: 



\x = x (6.6) 



1 = U q (A, - a 2 /,) 1 / 2 Q T (6.8) 
Q T Q = /q (6.9) 

The ML solution is unique upto an arbitrary q x q orthogonal matrix Q. Following this step, we 
can estimate the sources Sj using least squares to get: 

s i = Q(A q -a 2 I q )- 1 2U q T ( Xi - A) (6.10) 

The advantage of this approach is that it decouples the estimation of mixing parameters A, [i, a 2 
from the sources Sj. This approach is used for example in Hyvarinen et al. ([13]) with a 2 = and 
Beckmann et al. (p.j) with a 2 ^ 0. 
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Now suppose 



Zi = (A q - a 2 I q ) 2 U q T (xi - /i) and 
/ Wl T \ 



Q = W q 



w 2 



T 



\w q T ; 



then 



W T Zi 



( Wi T Zi\ 
W 2 T Zi 



\W q T ZiJ 



Since Q is orthogonal the vectors Wi satisfy the constraints 



Wi wj = 5, 



'j 



(6.11) 
(6.12) 



(6.13) 



5.35 



(6.14) 
becomes 



where <5y = 1 if i = j and otherwise. Since W T is orthogonal, the second term in 
independent of W T and so the objective function f(si) reduces to a sum of the differential entropies 
of the components of s^: 



mm^T 



where: 



encourages non-Gaussianity 



(6.15) 



(6.16) 
(6.17) 



It is worth noting that since W T is constrained to be orthogonal, no apriori assumptions can be 
made on the correlation structure of Si. Thus the components of Si will have co- variance 
= W T S Zi W where is the co-variance of Zi. Now Xl Zi = I q only if a 1 = 0. This means 
that unless a 2 = 0, the components of Sj will not have unit variance and cross-correlation. 



6.1.3 Differential entropy approximations of Hyvarinen et al. 



In a seminal paper [12], Hyvarinen et al. proposed approximations to the differential entropy func- 
tion H(x) of a random variable with density P{x). The key idea in this work is to approximate 
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P(x) by a maximum entropy distribution (MED) given estimates of the expectations of m functions 
Gi of x i.e., given E x \Gi{x)} = c%. The solution to this problem is well known [6]: 



p med {x) = a exp 



(6.18) 



where oo,ai, . . . ,a m are constants. These (m + 1) constants can be solved for by simultaneously 
solving the m expectation equations under the MED density E x [Gi(x)] = c\ together with the 
normalizing equation f p med (x) dx = 1. Simultaneous solution of this (m + 1) system of non- linear 
equations is difficult. Hence, Hyvarinen et al. proposed a solution in which it is assumed that 
the density P(x) is "not very far from a Gaussian distribution". Under these conditions, using the 
simplified form of p med (x), Hyvarinen et al. derived approximations to the differential entropy H(x) 
(see pTj for details). In summary, suppose x has mean and variance a 2 . Let v be a Gaussian 
variable with the same mean and variance as x. The "near Gaussian" MED density approximation 
of Hyvarinen et al. for a single expectation constraint using a function G{x) is given by: 



med 
^Gaussian 



(x) =M{x | 0,a 2 ){l + cG(x)} 



The function G(x) is related to its normalized version G(x) by the equation: 

G(x) = -(G(x) + a\ x + «2 x 2 + 7) 



(6.19) 



(6.20) 



Here G{x) is any function of x not necessarily even or odd. The 4 constants ai,«2,7 and 5 are 
determined from the relations: 



J JV(x I 0, a 2 ) x k G(x) dx = 0, where k = 0,1,2 
Af{x I 0,a 2 )G(x)G(x) dx = 1 



(6.21) 
(6.22) 



Then the differential entropy approximation developed in Hyvarinen et al. using PGatssian( x ) fr° m 



6.19 is given by: 



H(x) « H{y) 



25 2 



{E x [G(x)} - E u [G(u)]Y 



(6.23) 



Note that the term E u [G{u)} is in case G(x) is an odd function. Substituting the differential 
entropy of a Gaussian random variable with mean and variance a 2 in 6.23 we get: 



H(x) 



1, 1,2 

- log 2vre + - log a 



' {E x [G(x)\ - E v [G(u)]} 2 



25 2 



(7 



Random variable with density J\f(u | 0, a 2 ) 
Var(x) 



(6.24) 

(6.25) 
(6.26) 
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Given a set of observed samples of x: xi,X2, ■ ■ ■ , x n , the objective function in 6.24 can be estimated 
by replacing the expected values by sample averages. 



Hyvarinen et al. proposed a solution in which the first component of §i is estimated by minimizing 
the differential entropy of the empirical density of §n = w\ T Zi over n realizations using the 

This minimization is carried out under the constraint w\ T w\ = 



from 



6.24 



approximation H(sn) 

1. The second component is estimated similarly by minimizing the differential entropy of the 
empirical density of S{2 = W2 T zi over n realizations while imposing the constraint W2 T W\ = and 
W2 T W2 = 1. In general, the mth component is estimated by minimizing the differential entropy of 
the empirical density of w m T Zi over n realizations subject to the constraint: 



w m T w m 
G T w m = 



1 



(6.27) 
(6.28) 
(6.29) 



where G = [wi, W2, ■ ■ ■ , w m _i] is a q x (m — 1) matrix with q > (m — 1). This is essentially the 
FastICA (FICA) algorithm proposed by Hyvarinen et al. [14} [13] and remains the most popular 
ICA algorithm to date. 



6.2 Case II: Correlated sources 

Sometimes the assumption of zero second order correlation between the components of si might 
be unrealistic. The modified assumption is that of maximal independence between components of 
Si under unknown and potentially non-zero correlation i.e., S s 7^ I q . If we introduce a change of 
variables: 

s* = £^s; (6.30) 
A* = £jA (6.31) 

then this case can be reduced to the standard PPCA model: 

Xi = fx + A Si + Si (6.32) 
= n + A* s* + Si (6.33) 

The transformed sources s* satisfy E(s*) = and E(s* s* T ) = I q , however maximal independence 
requirement will be imposed on §i and not on s"? . Therefore, following the development in the 
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previous section, the PPCA solution is given by: 

/} = x 



a 



1 



P 



A* = U q (A, 
Q T Q 

The least squares source estimates are given by: 



i=q+l 

a 2 I q ) l/2 Q 7 



S s 5 St = Q (A„ - a 2 I„)-2 U q T (xi 



A) 



From EH] and EH] 



§i = S s 2 Q (A q - <7%) 2 LV (aji - /} 
i 

= W* T Zi 



where 



fwf\ 

T 



w* 



\< T J 



(6.34) 
(6.35) 

(6.36) 
(6.37) 

(6.38) 

(6.39) 

(6.40) 
(6.41) 

(6.42) 



is the unmixing matrix for recovering the sources §i. Note that when the sources are not exactly 
uncorrelated, the unmixing matrix satisfies W* T W* = S s instead of W T W = I q as in the 
exactly uncorrelated case. In other words, sources Sj are extracted by non-orthogonal projections 
of vectors zi. In this case, the optimization problem can be written as (ignoring terms independent 
of W* T ): 



mui w *T f(§i) oc 



where: 



subject to: 




+ 



encourages non-Gaussianity 



ip {-log[abs(det(W T ))]} 

V v ' 

> 0, encourages S s = I q 



w* 1 zt, abs(det(VT* i )) 



T\\ _ det(£ Si ; 



det(S 5 



W* T S Z W* 



(6.43) 

(6.44) 

(6.45) 

(6.46) 
(6.47) 
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Note that 5]^ = S s only if Yl z = I q . Thus the individual components of Sj will not have unit 
variance unless a 2 = 0. In the case of potentially correlated sources, the 2nd term in the objective 
function |6.43 does not drop out. Hence sequential extraction of sources is only possible in the case 
when ip = 0. As discussed before equation 5.35, this corresponds to the case when minimization of 
differential entropy is considered to be much more important than minimizing correlatedness. We 
consider the solution of 6.43 with ip = in the correlated case. 

Since p^* T VF* = S s and the diagonal elements of S s are ones, the mth component is estimated 
by minimizing the differential entropy of the empirical density of Si m = w^ 1 zi over n realizations 
subject to the only constraint: 

= 1 (6-48) 

Since the off diagonal elements of S s are allowed to have non-zero values, no additional mutual 
orthogonality constraints are imposed on the projection vectors. An important question is "How 
do you impose non-orthogonality?". It is possible that 2 vectors and are identical, since 
the algorithm does not explicitly enforce non-orthogonality. A simple workaround is to randomly 
initialize each vector by making orthogonal to the previously estimated vectors. In our 
experiments, this normally prevents convergence to a previously found solution, but again this is not 
guaranteed. In general, one must continue to randomly initialize solution for the mth component 
until it is found to be different from the previous (to — 1) solution vectors. 



7 PMOG based BSS 

In this section, we try to answer the following questions: 

• What is the relationship between the PMOG objective function and differential entropy? 

• How can PMOG be applied to the problem of BSS? 

7.1 PMOG objective and differential entropy 

Suppose u is a random variable with density P(u). The differential entropy of u is given by: 



H(u) = - j P{u) \ogP{u)du = -E u [log P(u)] (7.1) 



The exact density P(u) ofu is not known but suppose we approximate it by a parameterized highly 
flexible density. It is well known that a mixture of Gaussians with a sufficient number of components 
R can approximate any probability density with any desired accuracy. Suppose P^° 9 (u | 7r, fi, a 2 ) 



is an i?-component MOG density as described in|4.5|and 4.6 Then we can write: 



P(u)KP™ 9 ( U \ir,fx,v 2 ) (7.2) 
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Substituting |7.2| in |7.1| we see that: 

H{u) = -E u [\ogP{u)\ « -E u [\ogP™ 9 {u | 7r,/i,er 2 )] (7.3) 
Suppose the exact density of u is not known but n independent samples of u: u\, 112, ■ . ■ , u n drawn 



as per P(u) are given. Then we can approximate the expectation in 7.3 by a sample average using 
the law of large numbers. With this approximation we get: 

H(u) « -E u [logP™ 9 (u I 7T,V,* 2 )] « -I^[l og P™>; I 7T,/X,<X 2 )] (7.4) 

1=1 

Suppose random variable u is related to another random variable z via a linear transform u = w z 
and so the samples Uj are generated by a linear transformation Uj = w T zi then we can write: 



1 " 

= if(^) « J>gP™ 9 (™ T *i I M, <x 2 )] (7.5) 



n 

i=l 



Suppose our goal is to find a projection vector w such that the differential entropy of the pro 
jection w T z is minimized given realizations of z: z±,Z2, ■ ■ ■ ,z n , Then as shown in equation 7.5 
minimizing H(w T z) is equivalent to maximizing X^=i[l°S P™ 9 \w T z^ \ 7T, /x, <r 2 )] which is es- 
sentially the log likelihood of the PMOG model. In addition to the log likelihood term, the PMOG 



objective in 4.17 also includes prior terms for tv and <r 2 . These prior terms simply prevent the 
collapse of a PMOG component density onto a single point and thus make the PMOG estimation 
well conditioned. 



7.2 Sequential source extraction in PMOG based BSS 



As discussed in the previous section, the assumption of uncorrelated sources X! s = I q results in 
the dropping out of the second term in 6.43 If the sources are correlated, S s 7^ I q then the 
second term drops out only if ip = i.e., when minimizing differential entropy is considered to be 
much more important than minimizing corr elatedness. The important point to note is that in the 
uncorrelated source case, a sequential estimation algorithm for BSS using the PMOG model always 
exists. In the potentially correlated source case, such a sequential algorithm will exist only when 
i/j = 0. 

In PMOG based BSS, we simply model the empirical density of w m T Zi by a flexible i?-component 



MOG density. As shown in |7.5[ this is equivalent to minimizing the differential entropy of the 

along with the MOG 
6.48 depending on whether we want to enforce S f 



projected points. We can use the PMOG EM algorithm [2] to estimate w r 



parameters under constraints 6.27 



or 



I q or 



S s 7^ I q respectively. Since a MOG density with sufficiently large R can accurately model any 
non-Gaussian density, we should be able to extract complex non-Gaussian source densities after 
solving the PMOG problem. Thus in PMOG based BSS: 
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We replace the variational approximation step in the work of Attias et al. [T] by an ML step 
for the PPCA model of Tipping et al. [20] to compute the mixing parameters. This step is 
similar to that of Beckmann et al. [2]. 

For latent source estimation, we use a least squares approach after mixing parameters have 
been estimated. However, we replace the objective functions based on approximation to the 
differential entropy of Hyvarinen et al. |12j with the PMOG model in which the latent sources 
are assumed to be described by an R component MOG density. 

PMOG could have a potential advantage in cases where the latent source densities are very 



complicated. By choosing a sufficiently large R in 7.5, this complicated density can be ap- 
proximated accurately by the PMOG model. 

As discussed above, sequential source estimation is possible under the PMOG model even under 
partial source dependence (for -0 = 0). Thus, the PMOG based BSS retains flexible source density 
modeling of Attias et al. [I], is computationally tractable and can be applied under partial second 
order dependence between sources. 



8 Experiments and Results 

To illustrate the performance of PMOG based BSS, we performed 2 experiments. 

• In Experiment 1, we generate several artificial "sources" using a MOG model. Next we 
create mixed data multiple times using the same "sources" but with different random mixing 
matrices. Note this is a case of "non-square" and "noise free" mixing. For each mixture, we 
run both FICA and PMOG followed by a statistical comparison of the performance of PMOG 
with FICA across multiple runs. 

• In Experiment 2, we use real 2-D pictures from a standard image repository as "sources" and 
mix them using random mixing matrices. Note that this is a case of "square" and "noise 
free" mixing. We then run FICA and PMOG on the mixed picture data and compare the 
resulting recovered sources using each method to the "true sources" by visual inspection. 

In each experiment, we used the implementation of FICA from the FastICA package of Hyvarinen et 
al. ([13 ,[16j) available from http : //www. cis .hut . f i/projects/ica/f astica/ index . shtml. 



8.1 Experimentl 

We generated synthetic data with q = 7 MOG sources using a mixture of R = 5 Gaussians. Each 
MOG source had n = 1000 sample points. For each MOG source, elements of the class fraction 
vector 7r were chosen from f7(0, 1) and then normalized to have a sum of 1. Elements of the mean 
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vector were chosen from J7(— 10, 10) and the elements of the variance vector er 2 were chosen from 
1/(1,5). 

Suppose S is the q x n matrix made up of the n realizations of the source vectors Sj, S = 
[81,82,..., s n ]. Let e n be a n x 1 vector of all l's and suppose the eigen-decomposition of the 
sample co- variance of is: 



1 " 1 

= - > J Si = ~ 

71 ^ 71 



Se„. (8.1) 



n — ' n 

U S A S U S T = 1{S- » s e n T ) {S - fi s e n T f (8.2) 

Here U s is a q x q orthogonal matrix and A s is a diagonal matrix. We construct whitened sources 
that have sample mean and identity sample co-variance as follows: 

S = Aj^ U S T (S - n s e n T ) (8.3) 

The q x n matrix S = [81,82,..., s~ n ] holds n samples of the whitened source vectors s~i . These 
whitened sources were mixed by random p x q mixing matrices and embedded into a p = 20 
dimensional space. We generated a mixture of blind sources m = 50 times, each time using a 
different mixing matrix as follows: 

Here 

A (m) 

is a random p x q mixing matrix with elements drawn from U(0, 1). The p x n matrix 
X( m ) = [x^^ , x 2 m \ . . . ,x^^] contains the mixed signal vectors in p dimensional space for the 
mth run. We analyzed each as follows: 

• Run FICA on to get q sources as rows of the q x tl matrix ^S?jca' used the 
default settings for the FICA algorithm from the FastICA package with the odd function 
G{x) = tanh(x). 

• Run orthogonal PMOG on X (m) to get q sources as rows of the q x n matrix ^pjvfOG' 
Definition 8.1. Match between two matrices 

Given two q x n matrices A and B, let pij (A, B) denote the correlation coefficient between the 
ith row of A and the jth row of B. We define the "match" between A and B as a quantity that 
measures the average value of the best absolute correlation coefficient of the rows of A with B as 
follows: 



Match 



1 

(A, B) = -^2 max [abs (A, B)\ (8.5) 



Our goal is to compare Match (^S, S^^j^j with Match (^S, -^pmog) across the m = 50 runs of 
FICA and PMOG. Since the distribution of the Match values is potentially non-Gaussian, we apply 
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a transformation to Normality to both Match (s, S^^,^ and Match (s, S^^qq^J as described 
in [21] followed by a statistical comparison of the transformed values: 

• In brief, we simply pass the empirical distribution of Match (^S, <S^tjc , a) across m through 
the inverse CDF of a Normal distribution with the same sample mean and variance. A 
similar transformation to Normality is applied to Match (^S, Sp^ OG j and this transforms 
the non-Gaussian Match values to Gaussianity. 

• Let us denote the Normally distributed Match values for FICA and PMOG using the notation 
Match^v or . ma ; ( S, S^c A J and Match^cr-maZ Spmog) respectively. We performed a 2- 



sample t-test with unequal variance to compare the mean of MatcliTVormaZ (jS, Sfica^J ana - 
Matchjv orma / ^pmog 

Results are shown in Fig. [3] PMOG produces results that are statistically significantly better than 
FICA with a p- value of ~ 1.13 x 10 -5 across the m = 50 runs. 



8.2 Experiment 2 

To illustrate the performance of PMOG based BSS on real data, we selected 2 separate sets of 
3 images from the Berkeley Segmentation Dataset and Benchmark |18j . First, each image was 
demeaned and standardized to have unit variance. In each case, the 3 identical size images were 
mixed by a random 3x3 mixing matrix A and mean vector fi whose elements were drawn from a 
M(0, 1) Normal distribution. Note that this is a case of "square" and "noise free" mixing. In both 
cases, the 3 mixed images were subjected to 3 different analyses: 

• FICA analysis with the standard orthogonality constraints on the projection vectors. We 
used the default settings for the FICA algorithm from the FastICA package with the odd 
function G(x) = tanh(x). 

• PMOG analysis with orthogonality constraints on the projection vectors. 

• PMOG analysis without orthogonality constraints on the projection vectors. 

Results are shown in Fig. [4] - [7} For the example shown in Fig. |4j FICA failed to converge 
using the default settings. Upon experimenting with various settings, we found that the FICA 
algorithm was able to converge for the data in Fig. [4] when we change the " decorrelation" approach 
to "symm" (see the FICA package for more details). The default settings of FICA resulted in 
successful convergence for the example in Fig. [6| 

It can be seen visually from Fig. [5] and Fig. [7] that the results of PMOG are better than that of 
FICA. Since there are dependencies between the intensities of corresponding pixels in the 3 images, 
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Figure 3: The top figure shows the quantities MatcliAr orma ; ( S, S FICA J (blue) and 

MatchjVormai {s , SfT^ OG ^ (red) across the m runs. The bottom figure shows the results of a 
2-sample i-test with unequal variance on the data from the top figure. The results from PMOG 
are statistically significantly better than those of FICA with a p- value of ~ 1.13 x 10~ 5 . 
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Original sources 




Figure 4: The top row shows 3 pictures of natural images i.e., the original sources. The bottom row 
shows mixed pictures obtained after mixing image intensities from each pixel of the 3 images using 
a 3 x 3 random mixing matrix and adding a random mean offset. This mixed data was analyzed 
using FICA and PMOG. Results are shown in Fig. [5j 
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FastICA (FICA) 
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Figure 5: This figure shows the results of running FICA and PMOG on the data from Fig. |4j The 
top row shows 3 estimated sources using the FICA algorithm. For this dataset the 'den' approach 
in FICA fails and hence we used the 'symm' approach. The middle row shows 3 estimated sources 
using PMOG algorithm with orthogonality constraint on the 3 projections. The bottom row shows 
estimated sources using the PMOG algorithm without imposing the orthogonality constraint on 
the 3 projections. 
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Original sources 
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Figure 6: The top row shows 3 pictures of natural images i.e., the original sources. The bottom row 
shows mixed pictures obtained after mixing image intensities from each pixel of the 3 images using 
a 3 x 3 random mixing matrix and adding a random mean offset. This mixed data was analyzed 
using FICA and PMOG. Results are shown in Fig. [7j 
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FastICA (FICA) 
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Non-orthogonal PMOC 




Figure 7: This figure shows the results of running FICA and PMOG on the data from Fig. [6j The 
top row shows 3 estimated sources using the FICA algorithm. The middle row shows 3 estimated 
sources using PMOG algorithm with orthogonality constraint on the 3 projections. The bottom row 
shows estimated sources using the PMOG algorithm without imposing the orthogonality constraint 
on the 3 projections. 
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they are not exactly independent. Thus, we see that PMOG without orthogonality constraint is 
able to achieve better source separation compared to PMOG with orthogonality constraint. 

As an example, we also show the fitted PMOG model for one of the extracted sources (the bottom 
left image in Fig. [7]) in Fig. g This fi gure shows the monotonic increase in PMOG objective 



function H(ir®, /xW, a 2{t) , w®) fromjoo 



over EM iterations t and the final fitted PMOG model 
using R = 5 Gaussian distributions. It is interesting to note that this distribution is highly non- 
Gaussian and hence approximations to negentropy used in FICA based on the "near Gaussianity" 
assumption in might not be adequate in such cases. 
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Figure 8: This figure shows the fitted PMOG model for one of the extracted sources (the bottom 
left image in Fig. [7j). Top fi gure shows the evolution of PMOG log likelihood over iterations 
illustrating the monotonic increase property of the EM algorithm for estimating the PMOG model. 
Bottom figure shows the 1-D projected density (corresponding to the estimated projection vector) 
fitted by a 5 component PMOG model. In PMOG, both the projection vector and the MOG 
parameters are jointly optimized. 



9 Discussion 

In this work, we posed the problem of estimating a projection of input data such that the projection 
is well described by an R component MOG density. We showed that it is possible to derive an EM 
algorithm for solving this problem. Since the estimation of projection vector is coupled with the 
estimation of distributional parameters, we break up the M-step into two parts: 
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In part 1 of the M-step, we estimate the distributional parameters for a fixed projection 
vector. 



• In part 2 of the M-step, we optimize for the projection vector while fixing the distributional 
parameters. 

We show that solving the M-step problem in part 2 is equivalent to finding the roots of a particular 
cubic equation for the projection vector. Since the objective in part 2 of the M-step is non- 
concave, the solution to the set of cubic equation in part 2 might converge to a minimum instead 
of a maximum. To enforce the monotonic likelihood increase property of the EM algorithm, we 



explicitly check equation 4.29 after each M-step. If this condition is not satisfied, we randomly 



re-initialize w and repeat the M-step. 
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Figure 9: A comparison of the features and estimation techniques of four different BSS algorithms. 

Next we considered the application of the PMOG model to the solution of the BSS problem. If 
the mixing process is without Gaussian noise then the estimation of mixing parameters and latent 
sources is uncoupled [14.. In this case, the BSS problem can be reduced to one of estimating an 
orthogonal matrix by using principal component analysis (PCA) as a preprocessing step. However, 
in the presence of additive Gaussian noise, the estimation of mixing parameters and latent sources 
is coupled and this makes the BSS problem a difficult one. 

Attias et al. p] developed a very general solution for the BSS problem under the assumption 
of exact independence between latent sources. In particular, [1] assumed an MOG density for 
each latent source. This implies a "factorial MOG" joint density for the vector of latent sources 
under independence. Further, pQ derived an EM algorithm for joint estimation of both the mixing 
parameters and "factorial MOG" parameters. While this work provides an exact solution to the BSS 
problem, it assumes exact independence between sources and becomes computationally intractable 
for > 13 latent sources. As a computationally tractable alternative, [T] propose a variational 
approximation to compute the BSS parameters followed by using a MAP or posterior mean estimate 
for the sources. 
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Beckmann et al. [2J proposed a BSS solution in which the estimation of mixing parameters and 
latent sources is de-coupled by assuming that the sources have joint distribution M(0,I q ). This 
effectively reduces the BSS model to the PPCA model of Tipping et al. [20]. In [2], the mixing 
parameters are chosen to be the ML solution for this PPCA model after which the BSS problem 
reduces to one of estimating an orthogonal matrix. In both [TJ] and [2] , the sources are estimated 
as orthogonal projections that minimize the sum of differential entropies of the projections as in 



equation 6.15 Both techniques |14j and [2] use approximations to the differential entropy developed 



in |12j based on the assumption of "near Gaussian" source densities. 

While the solution by Attias et al. pQ is very attractive (at least under exact independence) since 
it allows flexible source density modeling via a "factorial MOG", it suffers from computational 
intractability for > 13 sources. Our work bridges the gap between the approach of Attias et al. pQ 
and that of Hyvarinen et al. [Tl] and Beckmann et al. [2J. On the one hand, it retains computational 
tractability by using the PPCA approach of Beckmann et al. and on the other hand it allows for 



flexible source density modeling of Attias et al. via the PMOG model. As we show in 7.5 , minimizing 
the differential entropy is equivalent to maximizing the PMOG likelihood function. In the PMOG 
model, we jointly estimate both the projection vector as well as MOG distributional parameters to 
maximize the likelihood of observing an MOG in the projected space. The overall algorithm retains 
computational feasibility for sources > 13 while retaining the flexibility of modeling arbitrarily 
complex source densities (given enough Gaussians in PMOG). 

In the work of Attias et al. [1] an approximation is made to true ML objective (based on "factorial 
MOG" sources) using variational arguments to retain computational tractability for > 13 sources. 
In the present work and in Beckmann et al. [2J, exact ML solution is used for the analytically 
tractable PPCA model (based on M(0,I q ) sources). Both methods use approximations. Is the 
variational approximation more accurate than the PPCA approximation for BSS parameters? This 
question is open for discussion and beyond the scope of this work. However, based on our experi- 
ments with real and synthetic data the PPCA approximation followed by application of the PMOG 
results in good performance. We also show that when the latent sources are only approximately 
independent, then we should simply run the PMOG algorithm with only the unit norm constraints 
on the projection vectors (i.e. without the orthogonality constraints). Thus PMOG based BSS 
generalizes elegantly to cases where non-zero second order correlation exists between sources. Fig. 
[9] shows a comparison of PMOG based BSS with alternative approaches. 

It is worth noting the difference between the BSS algorithms in Fig. [9] and a related technique 
from statistics - projection pursuit (PP) EH IE]- In BSS, the problem is to estimate both the 
mixing parameters and latent sources given the linear mixing model [oTT| As shown in [1], this is a 
coupled estimation problem. However, if we use a PPCA step to estimate mixing parameters then 
this problem becomes decoupled and the latent sources can be estimated using least squares upto a 
linear transformation. If sources are assumed to be uncorrelated then this linear transformation is 
orthogonal. At this point, the problem becomes similar to a PP problem, where linear transforma- 
tions (not necessarily orthogonal) are sought that optimize general contrast functions. Whereas in 
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PP any contrast function can be used, the focus in BSS is to maximize the independence between 
sources and so a linear transformation that minimizes the mutual information between the sources 
is sought. Thus PMOG based BSS can also be thought of as a combination of PPCA preprocess- 
ing followed by PP using the special PMOG objective function which is optimized using an EM 
algorithm with or without orthogonality conditions on the projections. 

Both FICA and PMOG solve an optimization problem where different objective functions are 
maximized. In the case of FICA, the objective function is an approximation to differential entropy 
developed in |12j whereas in the case of PMOG the objective function is the PMOG log likelihood 



7.5 In both FICA and PMOG, the objective functions are not concave and so the solutions to 
the maximization problems are not unique. An ideal solution would be to find the global solution 
to the maximization problems in FICA and PMOG and compare the results. Since finding the 
global solution itself is a difficult problem an alternative suboptimal method of comparing FICA 
and PMOG is to run each algorithm multiple times on mixtures generated using the same "sources" 
and then compare the quality of results across runs. This is the approach we used in Experiment 1 



8.1 It was found that on average PMOG produced significantly better results compared to FICA 
across runs. We would like to mention that since the true "sources" in Experiment 1 are in fact 
MOG sources, the results could be slightly biased towards PMOG. Nevertheless, Experiment 1 does 
illustrate that when the true "sources" have a complicated density (such as a MOG) then PMOG 
might show improved performance. 

We also ran 2 illustrative experiments on publicly available natural image datasets [18] in Exper- 



iment 2 8.2 and compared the results of FICA and PMOG by visual inspection. FICA showed 
relatively poor performance compared to orthogonal PMOG. We postulate that this is because 
the source densities are multimodal (see Fig. [8]) and so the PMOG model captures them more 
faithfully compared to the approximate cost function [12] used in FICA. We also noticed that as- 
sumption of orthogonality of projection vectors also hurts the separation performance because the 
source images have non-zero second order correlation and hence are not exactly independent. The 
best performing algorithm was non-orthogonal PMOG in which maximal source independence is 
enforced with potentially non-zero second order correlation (i.e., a minimal level of dependence is 
allowed) . 

A limitation of PMOG based BSS is the running time of the PMOG algorithm. It is clear from 
numerical experiments that PMOG is much slower than FICA. For the experiments presented in 
this work, PMOG takes upto 7 minutes per source estimation versus only a fraction of a second 
for FICA using a computer with 2 x 2.93 GHz Quad-Core Intel Xeon processor with 16 GB RAM. 
Another drawback is the potential re-initialization required to enforce "non-orthogonality" in the 
form of PMOG where partial source dependence is allowed. A natural solution to this problem 



is to perform a joint PMOG estimation with ij) ^ in 6.43 which would automatically encourage 
orthogonality while not explicitly enforcing it. 



Finally, note that the approximation in 7.5 becomes increasingly accurate as n increases and thus 



ideally we would like the number of available data samples n to be large. However, this dependence 
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of the quality of approximation on n is not unique to PMOG based BSS. For instance, even in the 
differential entropy approximations of Hyvarinen et al. |12j such as 6.24, the expectations on the 



right hand side are approximated by sample averages which also become increasingly accurate as 
n increases. 



10 Conclusions 

We propose that the non-square linear BSS model with Gaussian noise can be estimated by first 
using the PPCA model of Tipping et al. [20] followed by application of the PMOG algorithm. 
The MOG density provides a flexible model for the unknown source density and simulations on 
illustrative data sets indicate that this approach might be a useful alternative to well established 
approaches such FICA. Furthermore, it is possible to allow for non-zero second order correlation 
between latent sources simply by relaxation of the orthogonality requirement in PMOG. This could 
result in better performance since the assumption of exact statistical independence is unlikely to 
be true for real world data-sets. 



The current version of PMOG based BSS solves the optimization problem in 6.43 for -0 = since 
this decouples the estimation of individual projection vectors. Future work would involve the joint 
solution of all projection vectors using a PMOG approach for tp = 1. 
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